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ABSTRACT 

The problems of validity and fairness involved in 
multiple-choice critical thinking tests can be lessened by using 
verbal reports of examinees* thinking during the process of 
developing such tests in order to retain only those items which rely 
on critical thinking skills to obtain the correct answer. 
Multiple-choice testing can lead to unfair treatment of students/ but 
disqualifying these tests can result in less powerful assessments of 
criikical thinking. Differences in background beliefs can lead to 
invalid and unfair assessments of critical thinking ability using 
these tests. A methodology to lessen these problems was developed 
using high school students taking trial versions of exams and 
thinking aloud as they took the test. The examinees* verbal reports 
were analyzed for critical thinking skills and assigned a numerical 
score. A high correlation between critical thinking scores and 
performance scores represents a high correspondence between thii ing 
critically and choosing thu correct answer. Items with a low 
correlation were revised. Al'^hough only a brief sketch of this 
methodology could be presented in this paper/ the relevance of verbal 
reports of thinking to test construction has been suggested by 
several testing specialists. Using verbal reports may be time 
consuming/ but the ideal of critical thinking is worth the effort. 
Otherwise/ only the worn-out and educationally indefensible emphasis 
on memorization of factual information/ rote recall/ and pat answers 
i;^ left. (Twenty-eight references are appended.) (RS) 
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Abstract 

This paper describes briefly a methodology for developing multiple-choice critical thinking tests which 
attempts to overcome certain problems of validity and fairness facing such tests. The concerns arise f jr 
two reasons: (a) it is plausible that for many multiple-choice critical thinking tests it is not differences 
m critical Jhmkmg ability but differences in other factors, such as examinees' background belief that 
accounts fot most variance in test performance; and (b) there is no Jirect evidence to counter this 
plausible hypothesis. The proposal is that such direct evidence be gathered during test development by 
eliciting from samples of students verbal reports of their thinking as they work through trial items 
Items would be retained, modified, or discarded according to whether or not critical and uncritical 
thinking is related, respectively, to choosing keyed and unkeyed answers to the items. 
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CONTROLLING FOR BACKGROUND BELIEFS 
WHEN DEVELOPING MULTIPLE-CHOICE CRITICAL THINKING TESTS 

During the last decade there has been an increasing interest in teaching critical thinking (Follman, 
1987; Resnick, 1987). The interest is motivated, in part, by data showing that school children learn 
large amounts of information, but learn less well how to analyze, synthesize, and evaluate that 
mformation for their own use (National Assessment of Educational Progress, 1985; National 
Commission on Excellence in Education, 1983). 

Concomitant with this growing mterest in critical thinkmg mstruction is an increasing desire to test for 
critical thinking. In meeting this desire there is a heavy reliance on multiple-choice tests However 
many people (e.g., McPeck, 1981; Petrie, 1986) claim that there are inherent flaws m multiple-choice 
tests of critical thinking. One purported flaw is that such tests cannot be used to distinguish variance in 
scores due to differences in those background beUefs of exammees which are not part of ability to think 
critically from variance due to differences in critical thinking ability. For many existing multiple-choice 
critical thinking tests, this criticism is well founded (Ennis & Norris, in press). 

If critical thinking assessment is to succeed, the problem of confoundmg background beUefs and critical 
thinking ability when interpreting scor'is must be solved. In this paper I shall illustrate how this 
problem with multiple-choice critical thinking tests can be lessened by usmg verbal reports of 
examineer' thinking to help develop the tests. 

In the first section, I show how multiple-choice tei.mg of critical thinking can lead to a dUemma: 
Adopting such testmg can lead to unfair treatment of students, but disqualifying multiple-choice testing 
of critical thinking can result in less powerful assessments of critical thinkujg. The second section 
illustrates how multiple-choice critical thinking tests can lead to mvalid and unfair assessment due to 
differences m exammees' background beUefs. The third section desaibes a methodology for using 
verbal reports of exammees' thinking on trial items to help avoid such mvaUdity and unfairness. 

A Dilemma in Multipls-Choice Critical Thinking Testing 

Theories of critical thinking, for instance that of Robert Ennis (1981), generally include standards and 
criteria for guidmg thinking. Only thinking in accord with those standards and criteria is taken to be 
critical thinking. Nevertheless, the standards and criteria are insufficient by themselves. Knowledge 
and good judgment are also needed. When thmklng about a complex problem, each individual draws 
upon his or her own background beUefs and sense of good judgment, so each person is Ukely to reach 
somewhat different solutions. Smce the standards and criteria of critical thmking are not always 
sufficient to define correct solutions, then more than one solution and approach might reflect critical 
thinking. 

The possibiUty of more than one good solution to a problem and of more than one good approach to 
reachmg a solution aeates difficulties for multiple-choice tests of critical thinking. If the background 
beUefs of some exammees are different from those of the examiner, then it is possible that, even though 
the exammees foUow the standards and criteria of critical thinkmg, they will be penalized because they 
choose answers different from those judged good by the exammer. On the other hand, examinees 
thinking uncriticaUy might be rewarded merely because they choose the same solutions that the 
examiner reached. 



The above possibiUties jeopardize the vaUdity and fairness of multiple-choice critical thinking tests. 
Unfortunately, there is no easy solution to this problem. On the one hand, the goal of critical thinking 
instruction is generaUy focused on teaching students how rather than what to think, because the critical 
spirit demands that multiple perspectives be accepted (Paul, 1982; Siegel, 1980, 1988). Thus, opting for 
test items with only one correct answer seems to introduce a validity-reducing factor into critical 
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thinking testing. On the other hand, multiple-choice items with one correct answer are one of the best 
tools m certam evaluation situations, for mstance, when the aim is to examme knowledge of the large 
number oi pnnciplos for judging the credibility of information or to assess many students. Is there a 
way out of this dilemma? 

I believe there U a way to effect a compromise at the test development stage through the use of verbal 
reports of students' thinking on trial test items. Before describing that methodology, I shall clarify how 
differences in background beliefs may lead to differences in performance on multiple-choice critical 
thinking tests. 

Differences in Background Beliefs as 
Explanations of Performance on Critical Thinking Tests 

The effect of differences in background beUefs wiU be illustrated using items from two commercially 
available aitical thmking tests. The discussion appUes to my multiple-choice critical thinking tests 
(wth the possible exception of deduction tests), however, since they all are subject to similar sorts of 
effects. 

Briefly, the argument to be made is that for many multiple-choice critical thinking tests variance in 
performance may be due more to differences in background beUefs than to differences in critical 
thinkmg ability, and that at present there is Uttle evidence to indicate the extent to which this may 
occur Consequently, there are several possible beneficial effects from taking this issue seriously 
enough to reexamme easting multiple-choice critical thinking tests: (a) it could lead to empirical 
results which either support the criticism or exonerate the criticized tests; (b) it could remove some of 
the suspicion which diminishes people's confidence in such tests and thus jeopardizes their usefuhiess- 
afld (c) It can force us to alter our test development methodologies in ways that might save multiple- 
choice tests for use in situations where they are eminently suitable. 

The Watson-GIaser Critical Thinking Appraisal 

The Watson-Glaser Critical Thinking Appraisal (WatsoD. & Glaser, 1980a), first developed in the late 
1930s, IS one of the oldest and most widely used critical thinking tests. It has often served as a bench 
mark for judging the vaUdity of other critical thinking tests and for evaluating the effectiveness of 
attempts to teach critical thinking. But the documentation available for the test (\Vatson & Glaser 
1980b) gives no direct evidence that variance in performance on the test is due primarily to differences 
m critical thinking ability and not to other factors, such as differences in background beliefs which are 
not part of critical thinking ability. Moreover, a plausible case can be made that several items test for 
differences m background beliefs, not critical tiiinking. 

Consider Item 6 as an example. For the item, examinees are to read a short passage. They are then 
&ven a statement and, on the basis of what they read in the passage, are to judge the statement either 
True, Probably True, False, or Probably False, or to judge that there is Insufficient Data to make a 
choice on the truth or falsity of the statement. Here is the passage: 

Mr. Brown, who Uves in the town of Salem, was brought before the Salem municipal 
court for the sixth time in. the past month on a charge of keeping his pool hall open 
after 1 a m. He agam admitted his guilt and was fined the maximum, $500, as in each 
earlier instance. 

Here is the statement to be judged: 

On some nights it was to Mr. Brown's advantage to keep his pool hall open after 1 
a,m., even at the risk of paying a $500 fine. 



Norris 



Background Beliefs and Critical Thinking Testing - 4 



The answer keyed correct is "Probably True" which, according to the test instructions, means that it is 
more likely true than false that on some nights it was to Mr. Brown's advantage to keep his pool hall 
open after 1 a m. But why is "Probably True" the keyed answer? There is no rationale provided in the 
test manual. In addition, the manual provides no direct evidence that examinees choosing the keyed 
answer generally think critically and that those choosing an unkeyed answer generally think uncritically. 
But this 's the relationship which must exist if the item is to differentiate among examinees on the basis 
ot their critical thinkmg ability. 

^''^fj^^Jf °^ '^"'"^ evidence, are there logical reasons for beUeving that the item works the way it 
should? That IS, is it plausible that generally when examinees choose the keyed answer they do so 
because they have thought aitically and that generally when they choose an unkeyed answer they have 
thought uncritically? PlausibUity in this case, I submit, is mversely proportional to the number of 
plausible wcys that have not been eliminated by evidence for examinees to think weU and choose 
unkeyed answers and to think poorly and choose the keyed response. The following argument, based 
upon one by Norris and Emiis (in press), shows that the Watson-Glaser test cannot meet this standard 
or plausibiuty. 

Suppose an examinee recognized the possibility that Brown was not teUing the truth when admitting 
guilt. Maybe it was Brown's son who kept the pool hall open and, although he disagreed with his son'-= 
action. Brown preferred to take the blame himself rather than see his son face the charges. On the 
other hand, an examinee might think that Brown was a victim in a cover-up and that admitti- - aiilt to 
^r,. . , ^ """""'^ a way of channeUing money to aooked municipal gowrmnent 

officials. Or, an examinee might think that Brown was teUing the truth but was suffering from a severe 
shock which provoked him to do things that were not to his advantage. Another examinee might 
consider that Brown kept his pool hall open late to protest what he considered an unfair ordinance 
which allowed only some estabUshments to remain open after 1 a.m. He did not think it was to his 
advantage to protest, but did so on principle. 

If an examinee's background beliefs led him or her to assume any of the above possibiUties, then the 
examinee would be justified in choosing "Probably Fake" as the correct answer. If an examinee 
thought of a number of these possibiUties, but could not decide among them on the basis of the 
information given, then that examinee would be justified in choosing "Insufficient Data" as the correct 
answer. In ^both .cases,^the.examinees would be marked wrong even though they thought well. But 
because of the multiple-choice format we would not have known how weU they thought. 

There is no available evidence on which possibiHtiss actually do come to the minds of examinees for 
Whom the Watson-Glaser test is designed (junior high school through coUege level). As a result 
examinees choices of answers do not provide sufficient information for deciding whether or not they 
are thinking aitically. Therefore, if we are to use the test as designed, we must rely by default when 
rating examinees' critical thinking on whatever reasoning led the test developers to choose "Probably 
True as the keyed response. We are not told what this reasoning is and there is no evidence of the 
extent to which the reasoning of examinees who choose the keyed response matches that of the test 
developers. 

We know for sure that thinking criticaUy can lead justifiably to different answer choices depending 
upon the background beUefs used. But since there is no evidence on how examinees tend to think 
when they reason through individual items on the test, and since there is no information on the 
reasonmg which supports the keyed responses to those items, there is no reason to believe that in 
general when examinees choose keyed responses they think criticaUy and when they choose unkeyed 
responses they think uncritically. 

This criticism cannot be countered by arguing that background belief effects wiU wash out in the 
averages over aU items on the test, because many items on the test are subject to the same sort of 
criticism. Nor can the criticism be countered by pointing to the large amount of correlation data 
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relating performance on the Watson-GIaser test to other variables (Watson & Glaser, 1980b) This 
data provides evidence on the convergent and discriminant validity of the test, but it is not compelling 
when the same data used to elaborate the nomothetic span of a test is also used to clarifv the construct 
he test measures m the first place (Embretson, Schneider, & Roth. 1986). This is especially true fo 
the construct of critical thmkmg. Conjectures about how critical thinking correlate with other 

variables arc quite untrustworthy, given the status of the theory of the construct. Therefore, inferring 
wh^her or not a test measures critical thinking from how it correlates with other variables is doubly 

-nierefore, we do not know the extent to which exammees are unfairly penalized for choosing unkeyed 
responses, even though they have thought critically, or the extent to which examinees are unfairly 
rewarded for merely choosmg keyed responses while using no critical thought at all. That is the 
prevalence of the problem is not known, because there has been nc systematic investigation of it. ' 

Test on Appraising Observations 

TTie Test on Appraismg Observations (Norris & King, 1983) focuses on one aspect of critical thinking, 
Ae abdity to evaluate statements of observation. This focus on a single aspect distinguishes it from the 
Watson-GIaser test, which exammes several aspects of critical thinking. But the Test on Appraising 
Observations is a multiple-choice test, so it is subject to the came sort of potential probleins from 
deferences in exammees' background beliefs. Therefore, the development methodology of the Test on 
Apf raismg Observations, which was designed to minimize these problems, is relevant to the 
development of the Watson-GIaser and othei- multiple-choice critical thinking tests. 

Iflh^l- ? ^ 1[° ^^''^ knowledge of various principles for judging the credibUity of reports 

of observations. In Part A, items are cast in the context of a traffic accident. Witnesses and people 
who were mvolved m the acadent report what they observed happening. In each item, two under Jed 
reports are presented and the task is to decide which, if either, of the reports is more believable. 

Consider Item 9. In it, Ms. Vernon and Martine, both witnesses to the accident and drivers of nearby 
but unmvolved cars, report on cars they saw going through a stop sign. 

Ms. Vernon then says, "I also remember that a fancy blue sp nrt s car went thmuj h ^h. 

Stop SlgT^. 

Martine says. "A car with twin headlights wp.nf right thrnii g |i t ^f ^t?P "n " 

^inees are told to choose between the underlined statements. The answer keyed correct is that 
Vernon s observation is more believable, because being a fancy blue sports car is taken to be more 

Sise"il^tin\frr' ^^i^^'^^S^^^- -tended 7o test knowledge of t^e Principre of 

Observa lonalSahence: Observations of more salient features of events tend to be more credible than 
obsemuons of less sdient features. Features of events are salient to the degree that they a« 
ex.rao^dmary. colorful, mterestmg, and novel, and not salient to the degree that they are routine and 
commonplace (Loftus, 1979; Nisbett& Ross, 1980). mcy are routine ana 

llJL?Tn°''/ °- ^*"^^dicate their knowledge of the Principle of Observational 

Salience? Consider an exammee who knows the critical thinking principle, but believes that having 
twin headhghts IS a more sahent feature than bemg a fancy blue sports car. Based on this belief the 
exammee would be justified in choosing Martine's statement as more believable. Or siZt an 
exammee knows the prmciple, but believes that neither feature more salient. That exaie is 
S.ri,? , neither statement is more believable. Imagine two other examinees who know 

the principle and beheve that bemg a fancy blue sports car is more saUent in the daytime, but that 
L StT ^ r v?^? ' "f/ "^Sht. If one examinee imagines the situation to be 

at mght Oustifiably choosmg Martmes statement as more believable) and the other im>,gines it to be 
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• day Gustifiably choosing Vernon's statement), then one examinee will choose the keyec* response and ' 
the other an unkeyed response, even though both know the principle being tested. 

The intent of the item is to differentiate between those who know and do not know the Principle of 
Observational Salience, not to differentiate between those who know and do not know whether being a 
fancy blue sports car or having twin headlights is more salient, or between those who imagine it is night 
and those who imagine it is day. However, on a multiple-choice item where only choice of answer is 
revealed, it is difficult to know on which basis differentiation among examinees is really being made 

^^!Z '^'^ P"'""''^ Watson-Glaser and all other multiple-choice 

critical thinking tests. ^ 

However, the methodology used to develop the Test on Appraising Observations, which is described in 
he fol owing section,,provides evidence that greater than 95% of the high school students for whom 
the test IS designed assure that the situation takes place during daytime. In addition, the methodology 
provides evidence that fewer than 10% of high school students who know the principle being tested 
jjsume that havmg bvm headlights is a more salient feature than being a fancy blue sports car. 
Furthermore, the evidence shows that there is - .x)rrelation of .87 between thinking critically on the 
Item and choosing the keyed response. ' 

Similar evidence U available for each item on the test, although of course the numbers are not exactly 
the same. Thus, m contrast to the Watson-Glaser and most other critical thinking tests, there is direct 
evidence that the test differentiates primarily on the basis of differences in critical thinking and not 
some other factors, such as differences m background beliefs. 

A Methodology for Developing 
Multiple-Choice Critical Thinking Tests 

Trial versions of the Test on Appraising Observations were vetted by asking samples of students to 
thmk aloud as they worked on the items (Norris & King, 1984). Items were retained, modified, or 
discarded according to whether or not it was critical thinking and not some other factors, such as 
background behefs, which was the major contributor to differences in scores. This procedure was 
repeated with revised test versions until the average correlation between ' thinking critically and 
choosing the keyed response was greater than .70 across all 50 items.^ 

High school students took the trial versions in a one-on-one, tape-recorded interview format with one 
of the test developers The mterviews were conducted so as to be as non-leading as possible. The aim 
was to try to elicit from students reasoning that was not different in substantive ways from the 
reasoning they would have done had they taken the test m the normal paper-and-pencU format The 
interview approach has been shown subsequently not to change substantively the course of examinees' 
thinking (Norris, m press). Aammw!. 

First, the directions of the test were made clear to students. They were then asked to read the fiiTt 
Item aloud, to mark then: answer-choice on an answer sheet, and to say all that they were thinking as 
they chose their answer. At this stage of the interview, the interviewer interrupted students only to 
probe for ambiguous references and to check for reading errors. If students asked for additional 
mformation they were told that no information other than that m the test could be given. When 
students ha4 fimshed talking about the item, the interviewer had the option to pose questions before 
the students were asked to proceed to the next item. These questions were more leading and 
requested the specific reasons for students' choices of answers when these reasons were not made clear 
m what they had said. The procedure was repeated for subsequent items. 

For a given trial version of the test, about 50 students were interviewed. Each student was asked to 
thmk aloud on about one-fourth of the items and to do the remaining items in a paper-and-pencU 
sitting. Thus, for any item, from 12 to 15 verbal reports of thinking were obtained. These reports^wre 
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transcribed and then analyzed for the quality of the critical thinking they porttayed. The analysis 
involved studying each student's report and assigning a Thinking Score from 0 to 3 for each item to 
mdicate quality of the student's thinking. The thinking score for each item was based on the degree to 
which the student's thinkmg matched an ideal'model of thinking on that item. Thinking scores were 
assigned mdependently of the answer chosen, so it was possible for a student to obtain a high thinkinii 

keyed answer''""' ^ °' '° " '^^'"'"^ ^"'^ 

Thus, for each item on the test there were two sets of scores. The set of thinking scores (Os, Is, 2s. or 
3s) represented the quahty of each student's verbal report of thinking on that item The set of 
performance scores (Os or Is) represented whether or not the students had chosen the keyed answer 
These two sets of scores were correlated. A high correlation represents a high correspondence 
between thinkmg critically or unaiticaUy on a wen item and choosing, respectively, the kcyedVesponsc 
or unkeyed response. Thus, the correlations provide direct evidence of how weU items are working. 

When correlations were low, items were revised and retried using the same format. The verbal reports 
were very useful m making these revisions, because they often made quite dear why an item waTnot 
workmg as desired: There might have been an ambiguity in wording; students might not have 
understood what a particular expression meant; or students might have used background bcUets not 
part of their critical thmkmg abiUty which were different from those used by the test developers in 
choosmg the keyed response. 

An example of changes made to the directions will iUustrate how the methodology worked. RecaU that 
Items on the test are ca5t in the context of a traffic accident. In one version of the test, the directions 
contamed the names of all characters and what they were doing when the accident occurred. The 
rauonalc was that the hst of characters and roles would help examinees keep the information straight. 
But for that tes version the correlation between thinking and performance scores for the first several 
Items was too low. The verbal reports showed that many examinees used the information in the 
directions to answer these questions. To Ulustrate, in one item, two characters gave conflictbg reports 
about how many cars y;cre at the intersection. One character was more alert and therefore a more 
credible witness. CcmcidentaUy, that character reported that there were three cars at ttte intersection, 
the same number that the directions said were involved in the accident. The verbal reports showed that 
several exammees did not consider the alertness of the witnesses or any other relevant feature of the 
situation, but simply ated the number of cars mentioned in the directions as their answer These 
students equated unaitically the number of cars at the intersection with the number involved in the 
accident, but nevertheless chose the keyed response. This problem with the test was made quite 
promment by the verbal reporting methodology. ^ 

Discussion and Concluding Remarks 

I have given only a brief sketch of how the use of verbal reports of thinking on trial test versions can 
help provide evidence on the vaUdity of multiple-choice critical thinking tests. The relevance of verbal 
reporte of thuJong to test construction has been suggested by several testing specialists (e.g., Anastasi, 
1988, Cronbach, 1971; Haney & Scott, 1987; Messick, in press), but the technique has been used rarely 
(Norns, m press). However, verbal reports of thinking ,>it particularly relevant to the development of 
miUtiple-choice critical thmking tests, because satisfying the purposes of such tests demands direct 
evidence on the thinkmg processes students foUow when taking them. In tests designed to distinguish 
students who know certam pieces of factual information from those who do not, then kuowing the 
thinkmg processes of examinees might not be crucial. But when thinking processes are the focus of the 
evaluation, mferences about students' abiUties based merely on the answers they choose tend to be 
untrustworthy. 

The verbal report methodology does have some shortcomings. Hrst, there is the problem of 
generahzmg to examinees other than those whose verbal reports were obtained. Generalizability is 
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dways a problem m research, but we do not know the extent of the problem for aitical thinking 
testmg. In particular there is no good evidence on the extent to which different subgroups bring 
different background bcLefs to bear on the same problems. Thus, it is still not known how much of a 
solution the proposed methodology effects. For example, only high school students were involved in 
the verbal report studies of the Test on Appraisbg Observations. So we do not know how valid the test 
is for juuior high school students, college students, or high school students from different places. 

What arc the alternatives to the standardized multiple-choice tests currcnUy available? One alternative 
13 to avoid multiple-choice testing altogether, maybe by using conitructed-response tests which require 
essays or short answers. In their essays and short answers, examinees might reveal the background 
beliefs upon which their thinking is based. Examiners could then take this information into account in 
makmg judgments about examinees' levels of aitical thinking. However, constructed-responsc testing 
docs not provide complete assurance. There is no window into exammees' brains which shows all they 
are thinkmg or aU.of the basis for their decisions. Examinees likely do not know aU of these thinw 
themselves (Nisbett & Wilson, 1977). In addiUon, constructed-response testing raises other concert 
sudi as low mterrater reUabdity and the inabiUty to adequately cover a wide range of critical thinkinj? 
abdities and dispositions m a reasonable time (Norris, 1986). If, for example, evaluation of examinees' 
abdity o appraise observations is the a)nccm, then it is difficult to imagine how knowledge of all the 
Sh appraisal wuld be assessed u a constructed-responsc test of reasonable 

Another ^temative is to mix standard multiple-choice fonaats with other formats. For example 
multiple-choioe items could be supplemented by asking exammees to provide reasons for the answer^ 
they choose. So as not to turn a multiple-choice test entirely into a constructed-responsc test, reasons 
uu 1°' ^^"^ °^ Such an approach would provide an bdication of the 

background behefs exammees were usmg in their thinking and the examiner could take these beliefs 
mto account m asscssmg their aitical thinking. 

A third alternative might be to base aitical thinking tests only on school subject matter which students 
are supposed to have studied, instead of on general knowledge as found in most current tests 
Doubtless, not aU students wdl have learned the particular body of knowledge to the same degree, so 
differences in badcground behefs wiU continue to cause variance in scores. However, the influen« of 
differences m background behefs may be lessened and, even if not, it would not pose the same issues of 
fairness that arise when using aitical thinking tests based upon general knowledge. If students perform 
poorly on a aitical thmkmg test m science because they have not learned the required science content, 
^en, barrmg poor instruction or other extenuating circumstances, it is not unfair to mark them down 
However, there remains the question whether aitical thinking or science wntent was being tested. 
This vahdity issue would stiU need to be addressed. 

In addition, aitical thinking testmg based on school subject matter may not be the best way to 
determme whether aitical thinking has generalized to problems outside of school subjects. Much of 
the justification for teaching aitical thinking is based on an expected generallzability to everyday life 
situations outside the school and school subject matter, so it is important to test for how much this 
expectation is realized. 

Mmimizing problems arising from differences m background beliefs when testing for critical thinking 
Should be an important concern for those involved m aitical thinking evaluation. First, there is the 
concern for vahdity. If scores on tests are to be mterpreted as measures of aitical thmking abiUty. then 
It IS necessary to reduce as much as possible the effects of differences m background beliefs. Second, 
there is a concern for fairness and, as Messick (1975) argued over a decade ago, validity and fairness go 
hand in hand If we bchcvc our aitical thinking tests are valid, but we are reaUy differentiatmg among 
students on the basis of backgromid beUefs unrelated to aitical thinking, then there is a risk of treating 
ttiem unfairly. There is a risk of unfairly pcnalizmg students who are thinking aiticaUy but who do not 
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have the appropriate background beliefs, and a risk of unfairly rewarding students who are not thinking 
criticaUy but who, nevertheless, haw those background beliefs which enable them to perform correctly. 

The desire to teach critical thinking places many new demands on educators. One new demand is that 
test development practices will need alteration in order to aUow for the diversity of opinion and 
approaches to problems which aitical thinking encourages. The alterations may include the adoption 
of time consuming development methodoloff es such as the one described here. But the ideal of critical 
thmking is worth the effort. Otherwise, we are left with the worn-out and educationally indefensible 
emphasis on luemorization of factual information, rote recall, and pat answers. 
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Footnote 

^Correlation coefficients were not actuaUy used. The biserial correlation coefficient was the 
most-suitable estimate of correlation given the nature of the data, but it is subject to distortion from a 
variety of factors. Consequently, some correlations were greater than 1.0 and, hence, not interpretable. 
A statistic, caUed a Thinking/Performance Index, was developed and used in place of the correlations. 
The T/P Index is a measure of the net positive evidence available for an item from the interview data 
It IS described more fully in Norris and King (1984). 
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